EB-140: Investigate if alias files can be used to recode more meaningful scaffold names #74
+125
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Recently, we have discussed that the scaffold names displayed in the dropdown in the JBrowse session are unnecessarily complex, as they consist of the just the sequence accession numbers. It would be desirable to change this into something more self-explanatory, or, at least, to scaffold names used internally by the research groups. For instance,
chr1
is often used as a short-hand for chromosome 1 for assemblies that are considered to have chromosome-level completeness.This PR is to investigate if alias files can be used to recode more meaningful scaffold names using a feature update in JBrowse v2.14.0. The new feature introduces a
NcbiSequenceReportAliasAdapter
designed to allow NCBIsequence_report.tsv
to be used as a refNameAlias and from there recode the displayed names of the scaffold in the JBrowse session.The following experiment is focused on testing the feature in a reproducible manner to evaluate if it is something that we are interested in pursuing. This test implementation is not fully compatible with the current logic of the makefile. Instead it uses
jq
to do an a posteri value adjustment to the finalconfig.json
to specify thatNcbiSequenceReportAliasAdapter
should be used instead ofRefNameAliasAdapter
. Specifying this in the initialconfig.json
did not work, as it was overwritten when make runs the JBrowse CLI. Thesequence_report.tsv
used in this example uses a modified version of the file used in the JBrowse PR populated with the data from the L. tenue assembly.ABCDE
is a placeholder to check that the column was not used for the aliasing.Commands:
Result:
The short-form L. tenue scaffold names now display as desired! However, there are some things to consider from this implementation:
NcbiSequenceReportAliasAdapter
only support two scaffold name synonyms to be aliased, which is reasonable given the scope of the original PR. In this case, there were only two synonyms: the ENA-formatted fasta header and the "Figshare"-formatted fasta header. But we have previously assumed that we might need three synonyms at times if we need to display ENA assembly, NCBI GFF, and research group track using different header formatting together.sequence_report.tsv
to reflect this is ignored in the final session, indicating that the original names take precedence for sorting.LG1
fromENA|CAMGYJ010000002|CAMGYJ010000002.1
to achieve the new working defaultSession.In all, my impression at the time of writing is that I'm satisfied that this gives the desired results. The downside is that it might limit the aliasing to two synonyms and that we would need to implement a non-jq way to pass
'.assemblies[].refNameAliases.adapter.type = "NcbiSequenceReportAliasAdapter"
toconfig.json
.I would also be happy to post these questions the JBrowse developers to see what they think. Perhaps a new adapter type would need to be developed to support more synonyms and custom ordering, who knows.
What do you think @apfuentes? Is the result like you envisioned?
What are your thoughts, @kwentine?